摘要 :
Abstract The creation of an image from another and from different types of data including text, scene graph, and object layout, is one of the very challenging tasks in computer vision. In addition, capturing images from different ...
展开
Abstract The creation of an image from another and from different types of data including text, scene graph, and object layout, is one of the very challenging tasks in computer vision. In addition, capturing images from different views for generating an object or a product can be exhaustive and expansive to do manually. Now, using deep learning and artificial intelligence techniques, the generation of new images from different type of data has become possible. For that, a significant effort has been devoted recently to develop image generation strategies with a great achievement. To that end, we present in this paper, to the best of the authors’ knowledge, the first comprehensive overview of existing image generation methods. Accordingly, a description of each image generation technique is performed based on the nature of the adopted algorithms, type of data used, and main objective. Moreover, each image generation category is discussed by presenting the proposed approaches. In addition, a presentation of existing image generation datasets is given. The evaluation metrics that are suitable for each image generation category are discussed and a comparison of the performance of existing solutions is provided to better inform the state-of-the-art and identify their limitations and strengths. Lastly, the current challenges that are facing this subject are presented.
收起
摘要 :
Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual-linguistic relation alignment is proposed. The goal o...
展开
Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual-linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual-linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.
收起
摘要 :
We review the existing literature on generating text from visual data under the cross-modal generation umbrella, which affords us to compare and contrast various approaches taking visual data as input and producing text outputs, w...
展开
We review the existing literature on generating text from visual data under the cross-modal generation umbrella, which affords us to compare and contrast various approaches taking visual data as input and producing text outputs, while not limiting the analysis to narrowly-defined areas, such as image captioning. We provide a breakdown of image-to-text generation methods into generative and non-generative image captioning and visual dialogue, with further distinctions provided for relevant areas. We provide template methods and discuss the existing research in light of such template methods, highlighting both the salient commonalities between different approach and significant departures from the templates. Where it is of interest, we also provide comparisons between templates for distinct areas. To achieve a comprehensive review, we focus on research papers published at 8 leading machine learning conferences in the years 2016–2021 as well as a number of papers, which do not conform to our search criteria, but nonetheless come from leading venues. This is the first review we know of to provide a systematic description of the current state of image-to-text generation and tie distinct research areas together looking at them through the lens of cross-modal generation.
收起
摘要 :
Synthesizing photographic images from given text descriptions is a challenging problem. Although current methods first synthesize an initial blurred image, then refine the initial image to a high-quality one, the most existing met...
展开
Synthesizing photographic images from given text descriptions is a challenging problem. Although current methods first synthesize an initial blurred image, then refine the initial image to a high-quality one, the most existing methods are difficult to refine the initial image to an image corresponding to the text description. In this paper, the Multi-resolution Parallel Generative Adversarial Networks for Text to-Image Synthesis (MRP-GAN) is proposed to generate photographic images. MRP-GAN introduces a Multi-resolution Parallel structure to refine the initial images when the initial images are not synthesized well. The low-resolution semantics are maintained through the whole process by Multi-resolution Parallel structure. Response Gate is designed to fully explore the capability of Multi-resolution Parallel structure by aggregating the outputs of the multi-resolution parallel subnetworks. We also utilize an attention mechanism, named Residual Attention Network, to fine-tune more fine-grained details of the generated images. We evaluate our MRP-GAN model on the CUB and MS-COCO datasets. Extensive experiments demonstrate the state-of-the-art performance of MRP-GAN. Besides, we apply a Multi-resolution Parallel structure in the existing method to verify its transferability. (c) 2021 Published by Elsevier B.V.
收起
摘要 :
With the advent of generative adversarial networks, synthesizing images from text descriptions has recently become an active research area. It is a flexible and intuitive way for conditional image generation with significant progr...
展开
With the advent of generative adversarial networks, synthesizing images from text descriptions has recently become an active research area. It is a flexible and intuitive way for conditional image generation with significant progress in the last years regarding visual realism, diversity, and semantic alignment. However, the field still faces several challenges that require further research efforts such as enabling the generation of high-resolution images with multiple objects, and developing suitable and reliable evaluation metrics that correlate with human judgement. In this review, we contextualize the state of the art of adversarial text-to-image synthesis models, their development since their inception five years ago, and propose a taxonomy based on the level of supervision. We critically examine current strategies to evaluate text-to-image synthesis models, highlight shortcomings, and identify new areas of research, ranging from the development of better datasets and evaluation metrics to possible improvements in architectural design and model training. This review complements previous surveys on generative adversarial networks with a focus on text-to-image synthesis which we believe will help researchers to further advance the field.
收起
摘要 :
Generative adversarial networks (GANs) have demonstrated remarkable potential in the realm of text-to-image synthesis. Nevertheless, conventional GANs employing conditional latent space interpolation and manifold interpolation (GA...
展开
Generative adversarial networks (GANs) have demonstrated remarkable potential in the realm of text-to-image synthesis. Nevertheless, conventional GANs employing conditional latent space interpolation and manifold interpolation (GAN-CLS-INT) encounter challenges in generating images that accurately reflect the given text descriptions. To overcome these limitations, we introduce TextControlGAN, a controllable GAN-based model specifically designed for text-to-image synthesis tasks. In contrast to traditional GANs, TextControlGAN incorporates a neural network structure, known as a regressor, to effectively learn features from conditional texts. To further enhance the learning performance of the regressor, data augmentation techniques are employed. As a result, the generator within TextControlGAN can learn conditional texts more effectively, leading to the production of images that more closely adhere to the textual conditions. Furthermore, by concentrating the discriminator's training efforts on GAN training exclusively, the overall quality of the generated images is significantly improved. Evaluations conducted on the Caltech-UCSD Birds-200 (CUB) dataset demonstrate that TextControlGAN surpasses the performance of the cGAN-based GAN-INT-CLS model, achieving a 17.6% improvement in Inception Score (IS) and a 36.6% reduction in Fréchet Inception Distance (FID). In supplementary experiments utilizing 128 × 128 resolution images, TextControlGAN exhibits a remarkable ability to manipulate minor features of the generated bird images according to the given text descriptions. These findings highlight the potential of TextControlGAN as a powerful tool for generating high-quality, text-conditioned images, paving the way for future advancements in the field of text-to-image synthesis.
收起
摘要 :
Altering the content of an image with photo editing tools is a tedious task for an inexperienced user, especially, when modifying the visual attributes of a specific object in an image without affecting other constituents such as ...
展开
Altering the content of an image with photo editing tools is a tedious task for an inexperienced user, especially, when modifying the visual attributes of a specific object in an image without affecting other constituents such as background etc. To simplify the process of image manipulation and to provide more control to users, it is better to utilize a simpler interface like natural language. It also enables to semantically modify parts of an image according to the given text. Therefore, in this paper, we address the challenge of manipulating images using natural language descriptions. We propose the Two-sidEd Attentive conditional Generative Adversarial Network (TEA-cGAN) to generate semantically manipulated images. TEA-cGAN & rsquo;s contribution is seen as two-fold. The first contribution aims to attend locations that need to be modified during generation. It introduces two types of architectures that provide fine-grained attention both in the generator and discriminator of Generative Adversarial Network (GAN). To be specific, the first one i.e., the Single-scale architecture used in the generator focuses to modify only the text-relevant regions in an image and leaves other regions untouched. While the second one i.e., Multi-scale architecture further extended this idea by taking the different scales of image features into account. The second contribution purpose is to generate higher resolution images (e.g., 256 x 256) as they provide better quality and stability. Quantitative and qualitative experiments conducted on CUB and Oxford-102 datasets confirm that TEA-cGAN different scale architectures outperform existing methods while generating 128 x 128 resolution images including generating higher resolution image i.e., 256 x 256. (c) 2020 Elsevier Ltd. All rights reserved.
收起
摘要 :
Composing Text and Image to Image Retrieval (CTI-IR) is an emerging task in computer vision, which allows retrieving images relevant to a query image with text describing desired modifications to the query image. Most conventional...
展开
Composing Text and Image to Image Retrieval (CTI-IR) is an emerging task in computer vision, which allows retrieving images relevant to a query image with text describing desired modifications to the query image. Most conventional cross-modal retrieval approaches usually take one modality data as the query to retrieve relevant data of another modality. Different from the existing methods, in this article, we propose an end-to-end trainable network for simultaneous image generation and CTI-IR. The proposed model is based on Generative Adversarial Network (GAN) and enjoys several merits. First, it can learn a generative and discriminative feature for the query (a query image with text description) by jointly training a generative model and a retrieval model. Second, our model can automatically manipulate the visual features of the reference image in terms of the text description by the adversarial learning between the synthesized image and target image. Third, global-local collaborative discriminators and attention-based generators are exploited, allowing our approach to focus on both the global and local differences between the query image and the target image. As a result, the semantic consistency and fine-grained details of the generated images can be better enhanced in our model. The generated image can also be used to interpret and empower our retrieval model. Quantitative and qualitative evaluations on three benchmark datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.
收起
摘要 :
This article offers a comprehensive review of the research on Natural Language Generation (NLG) over the past two decades, especially in relation to data-to-text generation and text-to-text generation deep learning methods, as wel...
展开
This article offers a comprehensive review of the research on Natural Language Generation (NLG) over the past two decades, especially in relation to data-to-text generation and text-to-text generation deep learning methods, as well as new applications of NLG technology. This survey aims to (a) give the latest synthesis of deep learning research on the NLG core tasks, as well as the architectures adopted in the field; ( b) detail meticulously and comprehensively various NLG tasks and datasets, and draw attention to the challenges in NLG evaluation, focusing on different evaluation methods and their relationships; (c) highlight some future emphasis and relatively recent research issues that arise due to the increasing synergy between NLG and other artificial intelligence areas, such as computer vision, text, and computational creativity.
收起
摘要 :
Text-to-image generation aims to generate images from text descriptions. Its main challenge lies in two aspects: (1) Semantic consistency, i.e., the generated images should be semantically consistent with the input text; and (2) V...
展开
Text-to-image generation aims to generate images from text descriptions. Its main challenge lies in two aspects: (1) Semantic consistency, i.e., the generated images should be semantically consistent with the input text; and (2) Visual reality, i.e., the generated images should look like real images. To ensure text-image consistency, existing works mainly learn to establish the cross-modal representations via a text encoder and image encoder. However, due to the limited representation capability of the fixed-length embeddings and the flexibility of the free-form text descriptions, the learned text-to-image model is incapable of maintaining the semantic consistency between image local regions and fine-grained descriptions. As a result, the generated images sometimes miss some fine-grained attributes of the generated object, such as the color or shape of a part of the object. To address this issue, this paper proposes a Local Feature Refinement Based Generative Adversarial Network (LFR-GAN), which first divides the text into some independent fine-grained attributes and generates an initial image, then refines the image details based on these attributes. The main contributions are three-fold: (1) An attribute modeling approach is proposed to model the fine-grained text descriptions by mapping them into representations of independent attributes, which provides more fine-grained details for image generation. (2) A local feature refinement approach is proposed to enable the generated image to form a complete reflection of the fine-grained attributes contained in the text description. (3) A multi-stage generation approach is proposed to realize the fine-grained manipulation of complex images progressively, which aims to improve the performance of the refinement and generate photo-realistic images. Extensive experiments on the CUB and Oxford102 datasets show the effectiveness of our LFR-GAN approach in both text-to-image generation and text-guided image manipulation tasks. Our LFR-GAN approach shows superior performance compared to the state-of-the-art methods. The codes will be released at https://github.com/PKU-ICST-MIPL/LFR-GAN_TOMM2023.
收起